The focus of this lab is on developing spatial weights matrices and computing some spatial autocorrelation statistics. The dataset of this lab contains data on the 2004 (Bush vs. Kerry) Presidential elections. Here is the variables that we have in our election dataset (Bush_pct is our variable of interests):
## [1] "NAME" "STATE_NAME" "STATE_FIPS" "CNTY_FIPS" "FIPS"
## [6] "AREA" "FIPS_num" "Bush" "Kerry" "County_F"
## [11] "Nader" "Total" "Bush_pct" "Kerry_pct" "Nader_pct"
## [16] "MDratio" "hosp" "pcthisp" "pcturban" "urbrural"
## [21] "pctfemhh" "pcincome" "pctpoor" "pctlt9ed" "pcthsed"
## [26] "pctcoled" "unemploy" "pctwhtcl" "homevalu" "rent"
## [31] "popdens" "crowded" "ginirev" "SmokecurM" "SmokevrM"
## [36] "SmokecurF" "SmokevrF" "Obese" "Noins" "XYLENES__M"
## [41] "TOLUENE" "TETRACHLOR" "STYRENE" "NICKEL_COM" "METHYLENE_"
## [46] "MERCURY_CO" "LEAD_COMPO" "BENZENE__I" "ARSENIC_CO" "POP2000"
## [51] "POP00SQMIL" "MALE2000" "FEMALE2000" "MAL2FEM" "UNDER18"
## [56] "AIAN" "ASIA" "BLACK" "NHPI" "WHITE"
## [61] "AIAN_MORE" "ASIA_MORE" "BLK_MORE" "NHPI_MORE" "WHT_MORE"
## [66] "HISP_LAT" "CH19902000" "MEDAGE2000" "PEROVER65"
The first plot illustare the percentage of votes for Bush in each county of the USA. Based on this plot, we can conclude that in most of counties the percentage of votes for Bush is higher than 50%. Also in the second plot we compare the winner of each county, and it’s quite obvious that Bush is the winner of most of the counties.
## Warning in merge.data.frame(new_map, data.election@data, by.x = "id", by.y
## = "FIPS"): column name 'id' is duplicated in the result
contiguity matrix: In this part, we create differnt contiguity and weight matrix to describe spatial relationships.
It seems that number of neghborhoods for each conuty is not homoginious in the USA. We can devide the states into two parts; The east part with small counties, and the west part with mach larger counies. So one contiguity matrix may have compeltly differnt resullts in the east and west part of the states. Here, for our variable we decided to choose those methods are efficient for irregularly-spaced data. In the case of spatial analsysis of election dataset, three of the best contiguity structures that is most effective in preserving spatial relationships and spatial autocorrelation analysis are disrect contiguity (queen method), 6 nearest neighbor, and sphere of influence. It’s quite clear that other methods does not work well as they cannot model the east and west part of USA in a similar and appropriate way; that is, we have irrrelavent connections in the east part and on the other hand the connections in the west part is understimated. But, in the above-mentioned methods, relationships are modeled in a homoginious way, and the area of counties and/or the distance between neighborhood counties would not influence the reslts in the contiguity matrix. Because we should only choose one contiguity structures as the best for our analysis, we chose disrect contiguity (queen method) becasue of several reasons 1) in this method we do not model Second or higher order neighbors in our contiguity matrix (modeling second and higher order neighbors maybe appropriate in the east part of the state due to small counties and the patter that we see in the election, but that cannot be acceptable in the west part of the state), 2) We would not have neghborless units, 3) It’s an efficient methos for irregularly-spaced data, 4) there would not any irrrelavent connections in our contiguity structure, 5) The weight matrix is less noisy.
##
## Moran I test under randomisation
##
## data: data.election$Bush_pct
## weights: W_cont_mat
##
## Moran I statistic standard deviate = 51.731, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.5565174275 -0.0003219575 0.0001158676
##
## Geary C test under randomisation
##
## data: data.election$Bush_pct
## weights: listw2U(W_cont_mat)
##
## Geary C statistic standard deviate = 50.359, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic Expectation Variance
## 0.4204068521 1.0000000000 0.0001324606
## Warning in globalG.test(data.election$Bush_pct, listw = W_cont_mat,
## zero.policy = T): Binary weights recommended (sepecially for distance
## bands)
##
## Getis-Ord global G statistic
##
## data: data.election$Bush_pct
## weights: W_cont_mat
##
## standard deviate = 22.859, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Global G statistic Expectation Variance
## 3.299492e-04 3.211300e-04 1.488503e-13
## Joincount Expected Variance z-value
## 0:0 130.7534 54.0599 6.7744 29.466
## 1:1 1111.0765 1030.8162 12.6147 22.598
## 1:0 311.6701 472.6272 29.4787 -29.645
## Jtot 311.6701 472.6272 29.9470 -29.413
-Local Moran’s I: To investigate the spatial clusters of features with high or low values we used this measure. This measure calculates a value for feature in the dataset. One of the outputs of localF function is the Z values for each feature that illustrate the spatial clusters of high or low values. High positive Z values illustrate features surrounded with features that have similar values (either high values or low values). A low negative z-score is representative of an outlier. Here w should keep in mind that a feature with high value is not necessarly statistically significant hot spot. A feature is considered a statistically significant hot spot when is surrounded with other high values as well. We mapped both the value of Local Moran’s I and the Z value in tow diffrerent maps. From the map of Z values, we can see that the Z values are mostly high and positive that means we have clusters and based of the map of Local Moran’s I, we can conclud that these hot spots are as a result of high values surrounded by high values. Also there are a few outlires visible in the Z value map. We also created a LISA Cluster Map that clearly represent the low-low, low-high, high-low, and high-high clusters.
Local Getis-Ord G*: Here the high positive and negative Z values can be considered statistically significant. Large positive Z values indicated that high values are clustered and is called hot spot. On the otehr hand, Large negative Z values indicated that low values are clustered and is called cold spot. This value of Z also shows the intensity of this cluster. In fact, based on the R help tutorial, “High positive values indicate the posibility of a local cluster of high values of the variable being analysed, very low relative values a similar cluster of low values”. In our result, we can see variuos hot spots all around the state. The cold spots also have occured in a smaller portions od the states mostly in the NM, CO, MN, ME, NH, MA, and MS.
Here I fitted a linear model and compute spatial autocorrelation statistics on the residuals. The plot of residuals shows that residuals are strongly correlated with the independent variable (Bush_pct). Also I ran a spatial lag regression on the same parameters; in the linear model all of the independet variable were significant and the p-value for the model is quite small and I can reject the null hypothesis, but in the spatial lag regression, one of the independet variables is not significant and the p-value for the model is 0.31536, and I cannot reject the null hypothesis.
##
## Call:
## lm(formula = Bush_pct ~ pcturban + pctpoor + rent + BLACK, data = data.election)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.027 -7.492 0.608 8.377 32.494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.026712 1.002528 73.840 < 2e-16 ***
## pcturban -0.045997 0.008583 -5.359 8.97e-08 ***
## pctpoor -0.149748 0.029474 -5.081 3.98e-07 ***
## rent -0.021990 0.002628 -8.366 < 2e-16 ***
## BLACK -0.276413 0.015488 -17.847 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.64 on 3106 degrees of freedom
## Multiple R-squared: 0.1687, Adjusted R-squared: 0.1677
## F-statistic: 157.6 on 4 and 3106 DF, p-value: < 2.2e-16
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = Bush_pct ~ pcturban + pctpoor + rent + BLACK,
## data = data.election)
## weights: W_soi_mat
##
## Moran I statistic standard deviate = 52.992, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.5887375253 -0.0010309600 0.0001238647
##
## Call:lagsarlm(formula = Bush_pct ~ pcturban + pctpoor + rent + BLACK,
## data = data.election, listw = W_soi_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.20453 -4.43668 0.61451 5.23806 27.68438
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 24.1536105 1.2692688 19.0295 < 2.2e-16
## pcturban -0.0593983 0.0061530 -9.6535 < 2.2e-16
## pctpoor -0.1571981 0.0214897 -7.3150 2.573e-13
## rent -0.0013272 0.0018925 -0.7013 0.4831
## BLACK -0.1484095 0.0125471 -11.8282 < 2.2e-16
##
## Rho: 0.70705, LR test value: 1678.1, p-value: < 2.22e-16
## Asymptotic standard error: 0.014156
## z-value: 49.946, p-value: < 2.22e-16
## Wald statistic: 2494.7, p-value: < 2.22e-16
##
## Log likelihood: -11208.88 for lag model
## ML residual variance (sigma squared): 69.497, (sigma: 8.3365)
## Number of observations: 3111
## Number of parameters estimated: 7
## AIC: 22432, (AIC for lm: 24108)
## LM test for residual autocorrelation
## test value: 1.0081, p-value: 0.31536
All of the metrics show that high values are spatially clustered. One differnce that I noticed in these two metric is the differnt way of interpretation of their results. In Global Moran’s I, when the p value is statistically significant, the null hypothesis can be rejected, and based on the Moran’s I statistic value and Z score, we can conclud that the spatial distribution of high and /or low values are spatially clusered (Moran’s I statistic value is close to one and Z score is positve) or dispersed (Moran’s I statistic value is close to negative one and Z score is positve). On the other hand, in etis-Ord General G when the p value is statistically significant, the null hypothesis can be rejected, and based on the Z score, we can coclud that if z value is positive the spatial distribution of high values is more clustered than expectations and is z value is negative the spatial distribution of low values is more clustered than expectations (http://help.arcgis.com/En/Arcgisdesktop/10.0/Help/index.html#//005p0000000q000000).
The Getis-Ord General G method works well only when we have an even distribution of values. When we have both low and high values cluster, this method does not work properly because low and high clusters cancel each other out. Although in our dataset the dominent cluster is the high values, but still we can see the spatial low clusters in our map. So Global Moran’s I would be a better choice for our dataset.
Although Local Getis-Ord G and local moran’s I are used to find the clusters, but they have some key differnces. The output of localmoran function contains both the value of local moran’s I and the Z value. But, the output of localG is only Z value. Furthuremore, Interpretation of the z values in these two measures are completly different. In the local moran’s I, a positive z score for a feature is a result of having neighbor features with similar values, and a negative z score for a feature is a result being an outlier. But, in Local Getis-Ord G, high positive Z values is due to the clustering of high values, and high negative Z values is due to the clustering of low values. So, Local Getis-Ord G cannot be used to identify outliers. By the way, this difference between these two methods exists because in the local moran’s I, the value of each feature is not included in the analysis of that feature, but in Local Getis-Ord G, the value of each feature is included in its own analysis. Getis-Ord G cannot be perfect measure of local autocorrelation in our case because for each feature we only have about 6 neighbor in the the weight matrix, and if a feature surrounds with low values, that feature might be identified as a hot spot. So, Getis-Ord G would be appropriate solely if we increase the size of the neighborhoods.
library(knitr) rmarkdown::render(‘ghandehari_lab7.Rmd’)